Skip to main content

13 分布式架构经典图书和论文

经典图书

  • Distributed Systems for fun and profit 以一种更易于理解的方式,讲述以亚马逊的 Dynamo、谷歌的 Bigtable 和 MapReduce 等为代表的分布式系统背后的核心思想
  • Designing Data Intensive Applications 本书深入浅出地用很多的工程案例讲解了如何让数据结点做扩展
  • Distributed Systems: Principles and Paradigms 介绍了分布式系统的七大核心原理,并给出了大量的例子
  • Scalable Web Architecture and Distributed Systems 主要针对面向互联网(公网)的分布式系统
  • Principles of Distributed Systems 讲述了多种分布式系统中会用到的算法

经典论文

分布式事务

  • 《Transaction Across DataCenter》(YouTube 视频)Google I/O 大会上的演讲

Paxos 一致性算法

一种基于消息传递且具有高度容错特性的一致性算法

  • Bigtable: A Distributed Storage System for Structured Data
  • The Chubby lock service for loosely-coupled distributed systems
  • The Google File System
  • MapReduce: Simplified Data Processing on Large Clusters
  • Neat Algorithms - Paxos

Raft 一致性算法

  • In search of an Understandable Consensus Algorithm (Extended Version)
  • Raft - The Secret Lives of Data
  • Raft Consensus Algorithm
  • Raft Distributed Consensus Algorithm Visualization

Gossip 一致性算法

  • Dynamo: Amazon’s Highly Available Key Value Store 讲述 Amazon 的 DynamoDB 是如何满足系统的高可用、高扩展和高可靠的
  • Time, Clocks and the Ordering of Events in a Distributed System 主要解决分布式系统中的时钟同步问题
  • 马萨诸塞大学课程 Distributed Operating System 中第 10 节 Clock Synchronization 讲述了时钟同步的问题
  • Why Vector Clocks are Easy 和 Why Vector Clocks are Hard Vector Clock相关
  • Efficient Reconciliation and Flow Control for Anti-Entropy Protocols 用来做数据同步的 Gossip 协议的原始论文
  • Understanding Gossip (Cassandra Internals) Gossip 协议也是 NoSQL 数据库 Cassandra 中使用到的数据协议
  • Gossip Visualization 关于 Gossip 的一些图示

分布式存储和数据库

  • Amazon Aurora: Design Considerations for High Throughput Cloud -Native Relation Databases
  • Spanner: Google’s Globally-Distributed Database
  • F1 - The Fault-Tolerant Distributed RDBMS Supporting Google’s Ad Business
  • Cassandra: A Decentralized Structured Storage System
  • CRUSH: Controlled, Scalable, Decentralized Placement of Replicated Data

分布式消息系统

  • Kafka: a Distributed Messaging System for Log Processing
  • Wormhole: Reliable Pub-Sub to Support Geo-replicated Internet Services
  • All Aboard the Databus! LinkedIn’s Scalable Consistent Change Data Capture Platform

日志和数据

  • The Log: What every software engineer should know about real-time data’s unifying abstraction
  • The Log-Structured Merge-Tree (LSM-Tree)
  • Immutability Changes Everything
  • Tango: Distributed Data Structures over a Shared Log)

分布式监控和跟踪

  • Dapper, a Large-Scale Distributed Systems Tracing Infrastructure

数据分析

  • The Unified Logging Infrastructure for Data Analytics at Twitter Twitter 公司的一篇关于日志架构和数据分析的论文
  • Scaling Big Data Mining Infrastructure: The Twitter Experience 讲 Twitter 公司的数据分析平台是怎么做的
  • Dremel: Interactive Analysis of Web-Scale Datasets 介绍了Google 公司的 Dremel 的架构与实现,以及它与 MapReduce 是如何互补的
  • Resident Distributed Datasets: a Fault-Tolerant Abstraction for In-Memory Cluster Computing 论文提出了弹性分布式数据集(Resilient Distributed Dataset,RDD)的概念

与编程相关的论文

  • Distributed Programming Model
  • PSync: a partially synchronous language for fault-tolerant distributed algorithms
  • Programming Models for Distributed Computing
  • Logic and Lattices for Distributed Programming

其它的分布式论文阅读列表

  • Services Engineering Reading List
  • Readings in Distributed Systems
  • Google Research - Distributed Systems and Parallel Computing